library(tidyverse) 
set.seed(1)  
theme_set(theme_minimal()) 
options(scipen=15, digits = 5)

There are several new packages used in this tutorial. I show throughout where they are coming in so you can see which functions are coming from which package, but here’s the list:

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(corrplot)
## corrplot 0.84 loaded
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(ppcor)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
source("cor.mtest.R")

Here we’re using semi-toy data. Not a real study, but the relationships between the visual features are from real data.

In this fake study, participants saw an image for 1,2, or 3 seconds (Duration, continous IV). They had to rate on a scale of 1-7 how much they liked the image (Preference, continuous DV). Each image was previously rated for how natural the image was (Natural, continous IV 1-7). Towards the end, we’ll add on other image feature variables.

This is not a sample analysis pipeline. This tutorial is showing various tools in R to do regression analysis and how to calculate certain values by hand, so you understand where R’s output is coming from.

natprefdata <- read_csv("RegressionData1.csv")
## Parsed with column specification:
## cols(
##   Subject = col_double(),
##   Image = col_double(),
##   Duration = col_double(),
##   Natural = col_double(),
##   Preference = col_double()
## )

First we’ll visualize the data.

ggplot(natprefdata) + aes(x = Natural, y = Preference, color=Duration) + geom_point()

We can also summarize the data to the “by image” level before visualizing, collapsing across Duration.

byImage <- natprefdata %>% group_by(Image) %>% summarise(meanPref=mean(Preference),Natural=mean(Natural))

p1 <- ggplot(byImage) + aes(x = Natural, y = meanPref) + 
  geom_point() + 
  coord_cartesian(xlim=c(1,7.5), ylim=c(2,7))
p1

Simple linear regression

Does Naturalness predict Preference?

Calculating best fit line from scratch.

meanX <- mean(byImage$Natural)
meanX
## [1] 4.7
meanY <- mean(byImage$meanPref)
meanY
## [1] 5.0733
byImage2 <- byImage %>% 
  mutate(minusMeanX = Natural-meanX,
         minusMeanY = meanPref-meanY,
         productminusmeans = minusMeanX*minusMeanY,
         minusMeanXsqrd = (Natural-meanX)^2)
head(byImage2)
## # A tibble: 6 x 7
##   Image meanPref Natural minusMeanX minusMeanY productminusmeans minusMeanXsqrd
##   <dbl>    <dbl>   <dbl>      <dbl>      <dbl>             <dbl>          <dbl>
## 1     1     3.79    6.78       2.08     -1.28             -2.66            4.33
## 2     2     4.24    6.81       2.11     -0.833            -1.76            4.45
## 3     3     4.52    6.52       1.82     -0.557            -1.01            3.31
## 4     4     4.54    6.79       2.09     -0.534            -1.12            4.37
## 5     5     4.88    6.67       1.97     -0.192            -0.377           3.88
## 6     6     4.93    6.63       1.93     -0.145            -0.279           3.72
m <- sum(byImage2$productminusmeans)/sum(byImage2$minusMeanXsqrd)
m
## [1] 0.13826
# Rewrite above equation as b = y - mx
b <- meanY - m*meanX
b
## [1] 4.4235

Our best fit line is Preference = 0.14*Naturalness + 4.42. Let’s add that to our plot. This could also be written as Y = 4.42 + 0.14*Naturalness + epsilon.

p1 + geom_abline(slope = m, intercept = b)

What does R give us for the linear regression?

lm1 <- lm(meanPref ~ Natural, data = byImage)
summary(lm1)
## 
## Call:
## lm(formula = meanPref ~ Natural, data = byImage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1790 -0.4237  0.0164  0.6326  1.3257 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   4.4235     0.2519   17.56 <0.0000000000000002 ***
## Natural       0.1383     0.0492    2.81              0.0067 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.777 on 58 degrees of freedom
## Multiple R-squared:  0.12,   Adjusted R-squared:  0.105 
## F-statistic:  7.9 on 1 and 58 DF,  p-value: 0.00672

What does that output mean?

Call: shows you the formula you used in the regression

Residuals: Quantiles of the residuals

Coefficients: shows the estimates, standard errors, t-values, and p-values for the coefficients (betas) in your model, including the y-intercept (b)

Residual standard error: sqrt(MSE) with n-p degrees of freedom

R-squared: SSR/SST

Adjusted R^2: 1- [(1-R^2)(n-1)/(n-k-1)] where n = obs, k=num variables in model, not constants

F-statistic: MSR/MSE

What would happen to the slope and intercept if we used the original dataset? Their values remain the same, but our SE and p-values got smaller.

lm2 <- lm(Preference ~ Natural, data = natprefdata)
summary(lm2)
## 
## Call:
## lm(formula = Preference ~ Natural, data = natprefdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6367 -0.4911  0.0652  0.5976  1.7124 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   4.4235     0.0590    74.9 <0.0000000000000002 ***
## Natural       0.1383     0.0115    12.0 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.814 on 1198 degrees of freedom
## Multiple R-squared:  0.107,  Adjusted R-squared:  0.107 
## F-statistic:  144 on 1 and 1198 DF,  p-value: <0.0000000000000002

Reminder of how to get built-in R best fit line on the plot:

p1 + geom_smooth(method="lm", formula = "y ~ x")

What do the residuals look like?

plot(lm1$fitted.values, lm1$residuals)

Multiple linear regression

Do Naturalness and Duration seen predict preference?

First we will standardize our variables. Remember, you can only directly compare the size of betas when the variables have been standardized. Here we are getting z-scores manually; later in the tutorial you’ll see the function scale() will do this for you.

#Standardizing variables

natprefdata_zscores <- natprefdata %>% 
  mutate(DurationZScore = (Duration - mean(Duration))/sd(Duration),
         NaturalZScore = (Natural - mean(Natural))/sd(Natural),
         PreferenceZScore = (Preference - mean(Preference))/sd(Preference))
natprefdata_zscores
## # A tibble: 1,200 x 8
##    Subject Image Duration Natural Preference DurationZScore NaturalZScore
##      <dbl> <dbl>    <dbl>   <dbl>      <dbl>          <dbl>         <dbl>
##  1       1     1        3    6.78       3.75           1.22         1.02 
##  2       1     2        1    6.81       4.58          -1.22         1.03 
##  3       1     3        3    6.52       4.42           1.22         0.892
##  4       1     4        1    6.79       4.75          -1.22         1.02 
##  5       1     5        3    6.67       4.31           1.22         0.966
##  6       1     6        3    6.63       4.63           1.22         0.946
##  7       1     7        3    6.76       6.69           1.22         1.01 
##  8       1     8        1    6.8        5.02          -1.22         1.03 
##  9       1     9        2    4.05       5.16           0           -0.319
## 10       1    10        1    6.59       4.93          -1.22         0.926
## # … with 1,190 more rows, and 1 more variable: PreferenceZScore <dbl>

Linear model

multreg1 <- lm(PreferenceZScore ~ NaturalZScore + DurationZScore, data=natprefdata_zscores)
summary(multreg1)
## 
## Call:
## lm(formula = PreferenceZScore ~ NaturalZScore + DurationZScore, 
##     data = natprefdata_zscores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1077 -0.5680  0.0625  0.6848  2.0399 
## 
## Coefficients:
##                             Estimate            Std. Error t value
## (Intercept)    -0.000000000000000299  0.027272888807801814     0.0
## NaturalZScore   0.328832581163430393  0.027299038924182395    12.1
## DurationZScore -0.040860366402140795  0.027299038924182405    -1.5
##                           Pr(>|t|)    
## (Intercept)                   1.00    
## NaturalZScore  <0.0000000000000002 ***
## DurationZScore                0.13    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.945 on 1197 degrees of freedom
## Multiple R-squared:  0.109,  Adjusted R-squared:  0.107 
## F-statistic: 73.2 on 2 and 1197 DF,  p-value: <0.0000000000000002
#Confidence intervals for the beta
confint.lm(multreg1)
##                    2.5 %   97.5 %
## (Intercept)    -0.053508 0.053508
## NaturalZScore   0.275273 0.382392
## DurationZScore -0.094420 0.012699

Plotting the residuals

plot(multreg1$fitted.values, multreg1$residuals)

Other residuals analysis

We can look at the residuals visually for their spread and normality, as well as run statistical tests for heteroscedasticity.

plot(multreg1)

hist(multreg1$residuals)

#library(lmtest)
#Breusch-Pagan Test
bptest(multreg1)
## 
##  studentized Breusch-Pagan test
## 
## data:  multreg1
## BP = 3.69, df = 2, p-value = 0.16
bptest(multreg1, varformula = ~ fitted.values(multreg1))
## 
##  studentized Breusch-Pagan test
## 
## data:  multreg1
## BP = 3.64, df = 1, p-value = 0.056
#library(car)
# Score Test for Non-Constant Error Variance 
ncvTest(multreg1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 3.5867, Df = 1, p = 0.0582

I think this StackExchange answer provides a clear explanation of what these tests are doing differently: https://stats.stackexchange.com/a/261846

Partitioning the variance

#SSE - sum((actual-predicted)^2)
natprefdata_zscores$Predicted <- multreg1$fitted.values
SSE <- natprefdata_zscores %>% 
  mutate(error = PreferenceZScore - Predicted,
         errorsq = error^2) %>% 
  summarise(SSE = sum(errorsq)) %>% 
  pull(SSE)
SSE
## [1] 1068.4
#Another way
SSE1 <- sum((multreg1$residuals)^2)

#SST - sum((actual-mean)^2)
meanPref <- mean(natprefdata_zscores$PreferenceZScore)
SST <- natprefdata_zscores %>% 
  mutate(difffrommean = PreferenceZScore - meanPref,
         diffsq = difffrommean^2) %>% 
  summarise(SST = sum(diffsq)) %>% 
  pull(SST)
SST
## [1] 1199
#Another way
SST1 <- sum((natprefdata_zscores$PreferenceZScore - mean(natprefdata_zscores$PreferenceZScore))^2)

#SSR - SST-SSE
SSR = SST-SSE 
SSR
## [1] 130.59

Partitioning variance of reduced models

# SSR just for naturalness
justnat <- lm(PreferenceZScore ~ NaturalZScore, data=natprefdata_zscores)
Red_SSR_justnat <- sum((justnat$fitted.values - mean(natprefdata_zscores$PreferenceZScore))^2)
Red_SSR_justnat
## [1] 128.59
# SSR just for duration
justduration <- lm(PreferenceZScore ~ DurationZScore, data=natprefdata_zscores)
Red_SSR_justdur <- sum((justduration$fitted.values - mean(natprefdata_zscores$PreferenceZScore))^2)
Red_SSR_justdur
## [1] 1.0821
#Unique SSRs
Uniq_nat_SSR = SSR - Red_SSR_justdur
Uniq_dur_SSR = SSR - Red_SSR_justnat

# Shared SSR
Shared_SSR = SSR - Uniq_dur_SSR - Uniq_nat_SSR
Shared_SSR
## [1] -0.91756
# Shared SSR might be low because Duration and Naturalness aren't correlated.
cor(natprefdata_zscores$DurationZScore, natprefdata_zscores$NaturalZScore)
## [1] 0.032901

Calculating R-squared and adjusted R-squared by hand

#R^2 = SSR/SST
Rsq = SSR/SST
Rsq
## [1] 0.10892
# Adj R^2 = 1- [(1-R^2)(n-1)/(n-k-1)] where n = obs, k=num variables in model, not constants
n <- 1200
k <- 2
Adj_Rsq = 1 - ((1-Rsq)*(n-1)/(n-k-1))
Adj_Rsq
## [1] 0.10743

Cohen’s f-squared by hand

R doesn’t have a built in function for Cohen’s f-squared, but it’s pretty easy to calculate by hand.

# R^2 /(1 - R^2)

cohen_fsq = Rsq/(1-Rsq)
cohen_fsq
## [1] 0.12223

F-statistic

p <- 3
#MSE = SSE/df_err (df error = n-p)
MSE <- SSE/(n-p)
MSE
## [1] 0.89257
#Note from the 'summary' - Residual standard error was .944; That's the square root of MSE.
sqrt(MSE)
## [1] 0.94476
#MST = SST/df_tot (df total = n-1) n=observations
MST <- SST/(n-1)
MST
## [1] 1
#MSR = SSR/df_reg (df regression = p-1) p=num parameters
MSR <- SSR/(p-1)
MSR
## [1] 65.295
#F = MSR/MSE 
F <- MSR/MSE
F
## [1] 73.154
# p-value for F-statistic
pval_F <- pf(F, p-1, n-p, lower.tail = FALSE)
pval_F
## [1] 1.0622e-30

What if each trial couldn’t be used as a continous DV (hint: like Hit Rate for our pilot study)?

image_cond <- natprefdata_zscores %>% 
  group_by(Image, Duration) %>% 
  summarize(meanPref = mean(PreferenceZScore), Natural=mean(NaturalZScore))
image_cond
## # A tibble: 180 x 4
## # Groups:   Image [60]
##    Image Duration meanPref Natural
##    <dbl>    <dbl>    <dbl>   <dbl>
##  1     1        1   -1.40    1.02 
##  2     1        2   -1.43    1.02 
##  3     1        3   -1.55    1.02 
##  4     2        1   -1.04    1.03 
##  5     2        2   -1.08    1.03 
##  6     2        3   -0.852   1.03 
##  7     3        1   -0.600   0.892
##  8     3        2   -0.647   0.892
##  9     3        3   -0.683   0.892
## 10     4        1   -0.430   1.02 
## # … with 170 more rows
#Then we can run our lm on this dataset.

Changing the model

Interaction Terms

#Adding an interaction term between duration and naturalness
multreg_interaction <- lm(PreferenceZScore ~ NaturalZScore + DurationZScore + NaturalZScore*DurationZScore, data=natprefdata_zscores)
summary(multreg_interaction)
## 
## Call:
## lm(formula = PreferenceZScore ~ NaturalZScore + DurationZScore + 
##     NaturalZScore * DurationZScore, data = natprefdata_zscores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0878 -0.5664  0.0668  0.6922  2.0160 
## 
## Coefficients:
##                               Estimate Std. Error t value            Pr(>|t|)
## (Intercept)                   0.000497   0.027295    0.02                0.99
## NaturalZScore                 0.328803   0.027307   12.04 <0.0000000000000002
## DurationZScore               -0.041003   0.027308   -1.50                0.13
## NaturalZScore:DurationZScore -0.015133   0.027025   -0.56                0.58
##                                 
## (Intercept)                     
## NaturalZScore                ***
## DurationZScore                  
## NaturalZScore:DurationZScore    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.945 on 1196 degrees of freedom
## Multiple R-squared:  0.109,  Adjusted R-squared:  0.107 
## F-statistic: 48.8 on 3 and 1196 DF,  p-value: <0.0000000000000002
#You can also do this to get the same output, but it's less explicit about the full model:
multreg_interaction1 <- lm(PreferenceZScore ~ NaturalZScore*DurationZScore, data=natprefdata_zscores)
summary(multreg_interaction1)
## 
## Call:
## lm(formula = PreferenceZScore ~ NaturalZScore * DurationZScore, 
##     data = natprefdata_zscores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0878 -0.5664  0.0668  0.6922  2.0160 
## 
## Coefficients:
##                               Estimate Std. Error t value            Pr(>|t|)
## (Intercept)                   0.000497   0.027295    0.02                0.99
## NaturalZScore                 0.328803   0.027307   12.04 <0.0000000000000002
## DurationZScore               -0.041003   0.027308   -1.50                0.13
## NaturalZScore:DurationZScore -0.015133   0.027025   -0.56                0.58
##                                 
## (Intercept)                     
## NaturalZScore                ***
## DurationZScore                  
## NaturalZScore:DurationZScore    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.945 on 1196 degrees of freedom
## Multiple R-squared:  0.109,  Adjusted R-squared:  0.107 
## F-statistic: 48.8 on 3 and 1196 DF,  p-value: <0.0000000000000002
#Or if for some reason you only want the interaction term:
multreg_interaction2 <- lm(PreferenceZScore ~ NaturalZScore:DurationZScore, data=natprefdata_zscores)
summary(multreg_interaction2)
## 
## Call:
## lm(formula = PreferenceZScore ~ NaturalZScore:DurationZScore, 
##     data = natprefdata_zscores)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.426 -0.623  0.106  0.673  2.258 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   0.000509   0.028891    0.02     0.99
## NaturalZScore:DurationZScore -0.015480   0.028604   -0.54     0.59
## 
## Residual standard error: 1 on 1198 degrees of freedom
## Multiple R-squared:  0.000244,   Adjusted R-squared:  -0.00059 
## F-statistic: 0.293 on 1 and 1198 DF,  p-value: 0.588

Random Effects

#library(lme4)

#Adding random intercepts
multreg_randomimage <- lmer(PreferenceZScore ~ NaturalZScore + DurationZScore + (1|Image), data=natprefdata_zscores)
summary(multreg_randomimage)
## Linear mixed model fit by REML ['lmerMod']
## Formula: PreferenceZScore ~ NaturalZScore + DurationZScore + (1 | Image)
##    Data: natprefdata_zscores
## 
## REML criterion at convergence: 1079.9
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.201 -0.618 -0.006  0.641  3.113 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Image    (Intercept) 0.808    0.899   
##  Residual             0.111    0.334   
## Number of obs: 1200, groups:  Image, 60
## 
## Fixed effects:
##                            Estimate           Std. Error t value
## (Intercept)    -0.00000000000000454  0.11644840661170815    0.00
## NaturalZScore   0.32741752345271813  0.11649741308696240    2.81
## DurationZScore  0.00214915488942733  0.00990663998136926    0.22
## 
## Correlation of Fixed Effects:
##             (Intr) NtrlZS
## NaturalZScr  0.000       
## DuratinZScr  0.000 -0.003
multreg_randomimage_AIC <- lmer(PreferenceZScore ~ NaturalZScore + DurationZScore + (1|Image), data=natprefdata_zscores, REML=FALSE)
summary(multreg_randomimage_AIC)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: PreferenceZScore ~ NaturalZScore + DurationZScore + (1 | Image)
##    Data: natprefdata_zscores
## 
##      AIC      BIC   logLik deviance df.resid 
##   1077.5   1103.0   -533.8   1067.5     1195 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.202 -0.619 -0.006  0.641  3.115 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Image    (Intercept) 0.781    0.884   
##  Residual             0.111    0.334   
## Number of obs: 1200, groups:  Image, 60
## 
## Fixed effects:
##                            Estimate           Std. Error t value
## (Intercept)    -0.00000000000000301  0.11449070165759732    0.00
## NaturalZScore   0.32741786910113940  0.11453889929624642    2.86
## DurationZScore  0.00213864918852664  0.00990223236594340    0.22
## 
## Correlation of Fixed Effects:
##             (Intr) NtrlZS
## NaturalZScr  0.000       
## DuratinZScr  0.000 -0.003
AIC1 <- 1077.5

null_randomimage_AIC <- lmer(PreferenceZScore ~ 1 + (1|Image), data=natprefdata_zscores, REML=FALSE)
summary(null_randomimage_AIC)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: PreferenceZScore ~ 1 + (1 | Image)
##    Data: natprefdata_zscores
## 
##      AIC      BIC   logLik deviance df.resid 
##   1081.2   1096.5   -537.6   1075.2     1197 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.202 -0.620 -0.006  0.645  3.128 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Image    (Intercept) 0.888    0.942   
##  Residual             0.111    0.334   
## Number of obs: 1200, groups:  Image, 60
## 
## Fixed effects:
##                      Estimate        Std. Error t value
## (Intercept) 0.000000000000004 0.122028809503485       0
nullAIC <- 1081.2
#Can also use AIC to compare alternative models to each other, instead of just the null.
(deltaAIC <- AIC1-nullAIC)
## [1] -3.7
anova(multreg_randomimage_AIC, null_randomimage_AIC)
## Data: natprefdata_zscores
## Models:
## null_randomimage_AIC: PreferenceZScore ~ 1 + (1 | Image)
## multreg_randomimage_AIC: PreferenceZScore ~ NaturalZScore + DurationZScore + (1 | Image)
##                         Df  AIC  BIC logLik deviance Chisq Chi Df Pr(>Chisq)  
## null_randomimage_AIC     3 1081 1096   -538     1075                          
## multreg_randomimage_AIC  5 1078 1103   -534     1068  7.71      2      0.021 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(multreg_randomimage_AIC)
## Computing profile confidence intervals ...
##                    2.5 %   97.5 %
## .sig01          0.745328 1.070095
## .sigma          0.320323 0.347735
## (Intercept)    -0.228038 0.228038
## NaturalZScore   0.099283 0.555551
## DurationZScore -0.017287 0.021562

Categorical Variables

#Changing naturalness to a categorical variable

#Two categories
natprefdata_zscores <- natprefdata_zscores %>% 
  mutate(NatCategory = case_when(
    NaturalZScore > 0 ~ "High",
    NaturalZScore < 0 ~ "Low"
  ))

lm_categorical <- lm(PreferenceZScore ~ NatCategory + DurationZScore, data=natprefdata_zscores)
summary(lm_categorical)
## 
## Call:
## lm(formula = PreferenceZScore ~ NatCategory + DurationZScore, 
##     data = natprefdata_zscores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1392 -0.5740  0.0752  0.6685  2.0129 
## 
## Coefficients:
##                Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      0.2713     0.0372     7.3     0.00000000000051 ***
## NatCategoryLow  -0.6029     0.0554   -10.9 < 0.0000000000000002 ***
## DurationZScore  -0.0387     0.0276    -1.4                 0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.954 on 1197 degrees of freedom
## Multiple R-squared:  0.0909, Adjusted R-squared:  0.0894 
## F-statistic: 59.8 on 2 and 1197 DF,  p-value: <0.0000000000000002
#Three categories
quantile(natprefdata_zscores$NaturalZScore, probs = c(.33,.66))
##      33%      66% 
## -0.70092  0.90189
hist(natprefdata_zscores$NaturalZScore)

#An interesting use of 'case_when'
natprefdata_zscores <- natprefdata_zscores %>% 
  mutate(NatCategory3 = case_when(
    NaturalZScore > .9 ~ "High",
    NaturalZScore < -.7 ~ "Low",
    TRUE ~ "Medium"
  ))

natprefdata_zscores$NatCategory3 <- factor(natprefdata_zscores$NatCategory3, levels = c("Low", "Medium", "High"))

lm_categorical3 <- lm(PreferenceZScore ~ NatCategory3 + DurationZScore, data=natprefdata_zscores)
summary(lm_categorical3)
## 
## Call:
## lm(formula = PreferenceZScore ~ NatCategory3 + DurationZScore, 
##     data = natprefdata_zscores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0670 -0.5883  0.0526  0.6912  2.0252 
## 
## Coefficients:
##                    Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)         -0.4047     0.0478   -8.46 < 0.0000000000000002 ***
## NatCategory3Medium   0.5180     0.0685    7.56     0.00000000000008 ***
## NatCategory3High     0.6875     0.0669   10.28 < 0.0000000000000002 ***
## DurationZScore      -0.0393     0.0276   -1.42                 0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.956 on 1196 degrees of freedom
## Multiple R-squared:  0.0875, Adjusted R-squared:  0.0853 
## F-statistic: 38.3 on 3 and 1196 DF,  p-value: <0.0000000000000002

Notice that with categorical, R output doesn’t include all three categories - these betas are now interpreted as the difference in DV associated with going from that category to the base (un-included) category. This is how R deals with dummy coding, we didn’t include one category.

ggplot(data=natprefdata_zscores) + 
  aes(x=Duration, y = Preference, color=NatCategory3) + 
  geom_jitter(width = .2) + 
  geom_smooth(method="lm", formula = "y ~ x")

Both DVs as categorical

natpref_categorical <- natprefdata_zscores %>% 
  mutate(Duration_cat = as.factor(Duration)) %>% 
  dplyr::select(Subject, Image, Preference, NatCategory3, Duration_cat)

lm_categorical_both <- lm(Preference ~ NatCategory3 + Duration_cat, data=natpref_categorical)
summary(lm_categorical_both)
## 
## Call:
## lm(formula = Preference ~ NatCategory3 + Duration_cat, data = natpref_categorical)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.652 -0.508  0.043  0.596  1.734 
## 
## Coefficients:
##                    Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          4.7766     0.0524   91.09 < 0.0000000000000002 ***
## NatCategory3Medium   0.4483     0.0591    7.58    0.000000000000067 ***
## NatCategory3High     0.5922     0.0576   10.28 < 0.0000000000000002 ***
## Duration_cat2       -0.0747     0.0583   -1.28                 0.20    
## Duration_cat3       -0.0829     0.0583   -1.42                 0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.824 on 1195 degrees of freedom
## Multiple R-squared:  0.0879, Adjusted R-squared:  0.0848 
## F-statistic: 28.8 on 4 and 1195 DF,  p-value: <0.0000000000000002
confint(lm_categorical_both)
##                       2.5 %   97.5 %
## (Intercept)         4.67371 4.879481
## NatCategory3Medium  0.33230 0.564252
## NatCategory3High    0.47921 0.705258
## Duration_cat2      -0.18909 0.039793
## Duration_cat3      -0.19729 0.031535

More than two features

#Merging other image features 
imagefeatures <- read_csv("RegressionImageData.csv")
## Parsed with column specification:
## cols(
##   Image = col_double(),
##   EdgeDensity = col_double(),
##   Order = col_double(),
##   Natural = col_double(),
##   Habitable = col_double(),
##   Preference = col_double()
## )
#Here we don't want preference or naturalness from the new data because they are already in the current data set, so we are dropping them. Note, the car package also has a select function, so here I'm specifying that I want dplyr's version of that
imagefeatures <- imagefeatures %>% dplyr::select(-c(Preference, Natural))

#Check out the different join options in help. Joins will use a common variable (in this case our Image column) to merge data. Left join is used here because we want to keep all columns from the first data set, and match the data to them from the second data set.
#Can also do: natprefdata_2 <- merge(natprefdata, imagefeatures)
natprefdata_2 <- natprefdata %>% left_join(imagefeatures)
## Joining, by = "Image"
#Now we can run the new lm
morevariables_lm <- lm(Preference ~ Natural + Order + Habitable + Duration, data = natprefdata_2)
summary(morevariables_lm)
## 
## Call:
## lm(formula = Preference ~ Natural + Order + Habitable + Duration, 
##     data = natprefdata_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0841 -0.3903 -0.0113  0.4393  1.6832 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   0.3900     0.1618    2.41               0.016 *  
## Natural       0.4091     0.0150   27.31 <0.0000000000000002 ***
## Order         0.2744     0.0235   11.67 <0.0000000000000002 ***
## Habitable     0.3445     0.0192   17.92 <0.0000000000000002 ***
## Duration     -0.0198     0.0225   -0.88               0.378    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.635 on 1195 degrees of freedom
## Multiple R-squared:  0.458,  Adjusted R-squared:  0.456 
## F-statistic:  253 on 4 and 1195 DF,  p-value: <0.0000000000000002

Multicollinearity

Can check Variance Inflation Factor for this model VIFj = 1/(1-R^2j)

Rules of Thumb:

A VIF of 1 means there is no correlation; no sign of multicollinearity

A VIF of 4 means 75% of the variance is explained by these predictors – potential sign of multicollinearity and needs more investigation

A VIF of 10 means 90% of the variance is explained – serious multicollinearity!

#library(car)

vif(morevariables_lm)
##   Natural     Order Habitable  Duration 
##    2.7762    1.3467    3.3080    1.0032

Model Selection

There are other packages and methods for model selection in R; this is the easiest one to use in my opinion.

#library(MASS) #Already loaded as a required package by something else we are using
# Fit the full model 
full.model <- lm(Preference ~., data = natprefdata_2[,4:8])
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both", 
                      trace = FALSE)
summary(step.model)
## 
## Call:
## lm(formula = Preference ~ Natural + Order + Habitable, data = natprefdata_2[, 
##     4:8])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0660 -0.3913 -0.0141  0.4375  1.6624 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   0.3471     0.1542    2.25               0.025 *  
## Natural       0.4093     0.0150   27.34 <0.0000000000000002 ***
## Order         0.2742     0.0235   11.66 <0.0000000000000002 ***
## Habitable     0.3452     0.0192   17.97 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.635 on 1196 degrees of freedom
## Multiple R-squared:  0.458,  Adjusted R-squared:  0.457 
## F-statistic:  337 on 3 and 1196 DF,  p-value: <0.0000000000000002
#Forward selection
step.forward <- stepAIC(full.model, direction = "forward", trace = FALSE)
summary(step.forward)
## 
## Call:
## lm(formula = Preference ~ Natural + EdgeDensity + Order + Habitable, 
##     data = natprefdata_2[, 4:8])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0367 -0.3933 -0.0066  0.4430  1.6489 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   0.2751     0.1724    1.60                0.11    
## Natural       0.4054     0.0155   26.10 <0.0000000000000002 ***
## EdgeDensity   0.6898     0.7375    0.94                0.35    
## Order         0.2780     0.0239   11.65 <0.0000000000000002 ***
## Habitable     0.3423     0.0195   17.58 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.635 on 1195 degrees of freedom
## Multiple R-squared:  0.458,  Adjusted R-squared:  0.457 
## F-statistic:  253 on 4 and 1195 DF,  p-value: <0.0000000000000002
#Backward selection
step.backward <- stepAIC(full.model, direction = "backward", trace = FALSE)
summary(step.backward)
## 
## Call:
## lm(formula = Preference ~ Natural + Order + Habitable, data = natprefdata_2[, 
##     4:8])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0660 -0.3913 -0.0141  0.4375  1.6624 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   0.3471     0.1542    2.25               0.025 *  
## Natural       0.4093     0.0150   27.34 <0.0000000000000002 ***
## Order         0.2742     0.0235   11.66 <0.0000000000000002 ***
## Habitable     0.3452     0.0192   17.97 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.635 on 1196 degrees of freedom
## Multiple R-squared:  0.458,  Adjusted R-squared:  0.457 
## F-statistic:  337 on 3 and 1196 DF,  p-value: <0.0000000000000002

Logistic Regression

What if our DV was binary? We would need to run logistic regression. More on how to interpret this and other types of regression next quarter!

natprefdata_2 <- natprefdata_2 %>% 
  mutate(Pref_binary = case_when(
    Preference > median(Preference) ~ 1,
    Preference < median(Preference) ~ 0
  ))

natprefdata_2 %>% count(Pref_binary)
## # A tibble: 2 x 2
##   Pref_binary     n
##         <dbl> <int>
## 1           0   600
## 2           1   600
log_regression <- glm(Pref_binary ~ Natural + Order + Habitable + Duration, data = natprefdata_2, family = "binomial")
summary(log_regression)
## 
## Call:
## glm(formula = Pref_binary ~ Natural + Order + Habitable + Duration, 
##     family = "binomial", data = natprefdata_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0493  -0.9198   0.0326   0.8787   2.6255  
## 
## Coefficients:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -10.6241     0.8388  -12.67 <0.0000000000000002 ***
## Natural       1.1210     0.0754   14.86 <0.0000000000000002 ***
## Order         0.1605     0.1008    1.59                0.11    
## Habitable     1.1705     0.0869   13.47 <0.0000000000000002 ***
## Duration     -0.1324     0.0826   -1.60                0.11    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1663.6  on 1199  degrees of freedom
## Residual deviance: 1300.4  on 1195  degrees of freedom
## AIC: 1310
## 
## Number of Fisher Scoring iterations: 4

Transforming data

I had us calculate Z-scores manually above, but R has a built in function for that. Z-scoring is a linear transformation - it doesn’t change the shape of your distribution. Sometimes you may also want to do non-linear transformations. These can help make skewed data more normal.

x <- rexp(100, rate = .1)
hist(x)

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.373   3.511   8.088  10.355  12.925  48.328
xZ <- scale(x)
summary(xZ)
##        V1        
##  Min.   :-1.060  
##  1st Qu.:-0.727  
##  Median :-0.241  
##  Mean   : 0.000  
##  3rd Qu.: 0.273  
##  Max.   : 4.034
hist(xZ)

#Note - if you have 0s, log1p(x) will accurately calculate log(x + 1) for all numbers

logx <- log(x)
hist(x)

#You can also Z score after log transformation to get all variables in a similar scale. Sometimes this makes interpretation more difficult, so I usually don't do this if the variable is my DV.
logxZ <- scale(x)
hist(xZ)

Interpreting betas after different transformations:

If DV log-transformed:

log(DV) = beta*IV

Do: (exp(beta) - 1)*100

This equals the percent change in DV for one unit change in IV

Example: beta=.198, exp(.198-1)*100 = 22% change in DV

If IV log-transformed

DV = beta*log(IV)

Do: beta/100

This equals the unit change in DV for a 1% change in IV

OR

Do: For x percent increase, multiply the coefficient by log(1.x).

Example: beta=.198, For every 10% increase in the independent variable, dependent variable increases by about 0.198 * log(1.10) = 0.02 units

More info here: https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/

If z-scored DV and IV

scaled(DV) = beta*scaled(IV)

1 SD change in IV for beta-SD change in DV

Correlations

cor(natprefdata_2$Natural, natprefdata_2$Habitable)
## [1] -0.79465
cor(natprefdata_2$Natural, natprefdata_2$Habitable, method = "spearman")
## [1] -0.77314
cor.test(natprefdata_2$Natural, natprefdata_2$Habitable)
## 
##  Pearson's product-moment correlation
## 
## data:  natprefdata_2$Natural and natprefdata_2$Habitable
## t = -45.3, df = 1198, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.81461 -0.77281
## sample estimates:
##      cor 
## -0.79465
cor.test(natprefdata_2$Natural, natprefdata_2$Habitable, method = "spearman")
## Warning in cor.test.default(natprefdata_2$Natural, natprefdata_2$Habitable, :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  natprefdata_2$Natural and natprefdata_2$Habitable
## S = 510662977, p-value <0.0000000000000002
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## -0.77314

Partial and Semi-partial correlations

#library(ppcor)

#Regulr correlation
cor.test(natprefdata_2$Natural, natprefdata_2$Preference)
## 
##  Pearson's product-moment correlation
## 
## data:  natprefdata_2$Natural and natprefdata_2$Preference
## t = 12, df = 1198, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.27601 0.37709
## sample estimates:
##     cor 
## 0.32749
#Partial correlation: between Naturalness and Preference controlling for Order
pcor.test(x=natprefdata_2$Natural, y=natprefdata_2$Preference, z=natprefdata_2$Order)
##   estimate    p.value statistic    n gp  Method
## 1  0.47812 1.6626e-69    18.834 1200  1 pearson
# Partial correlation between two variables controlling for all other variables in the data frame
pcor(natprefdata_2[,4:8])
## $estimate
##               Natural Preference EdgeDensity     Order Habitable
## Natural      1.000000   0.602566    0.195956 -0.051076  -0.82866
## Preference   0.602566   1.000000    0.027048  0.319396   0.45324
## EdgeDensity  0.195956   0.027048    1.000000 -0.169131   0.13390
## Order       -0.051076   0.319396   -0.169131  1.000000   0.22524
## Habitable   -0.828663   0.453243    0.133904  0.225242   1.00000
## 
## $p.value
##                 Natural  Preference       EdgeDensity      Order   Habitable
## Natural      0.0000e+00 3.3170e-119 0.000000000007977 7.7327e-02 1.9766e-303
## Preference  3.3170e-119  0.0000e+00 0.349798656876870 8.5951e-30  1.0830e-61
## EdgeDensity  7.9770e-12  3.4980e-01 0.000000000000000 3.9111e-09  3.3373e-06
## Order        7.7327e-02  8.5951e-30 0.000000003911125 0.0000e+00  3.1134e-15
## Habitable   1.9766e-303  1.0830e-61 0.000003337257625 3.1134e-15  0.0000e+00
## 
## $statistic
##              Natural Preference EdgeDensity   Order Habitable
## Natural       0.0000   26.10048     6.90789 -1.7679  -51.1763
## Preference   26.1005    0.00000     0.93535 11.6514   17.5772
## EdgeDensity   6.9079    0.93535     0.00000 -5.9321    4.6710
## Order        -1.7679   11.65143    -5.93213  0.0000    7.9917
## Habitable   -51.1763   17.57717     4.67097  7.9917    0.0000
## 
## $n
## [1] 1200
## 
## $gp
## [1] 3
## 
## $method
## [1] "pearson"
# Semi-partial correlation: Between naturalness and preference, while just controlling order for naturalness. Note the order. The "control" from z is applied to the y-variable
spcor.test(x=natprefdata_2$Preference, y=natprefdata_2$Natural, z=natprefdata_2$Order)
##   estimate    p.value statistic    n gp  Method
## 1  0.45169 2.4891e-61    17.516 1200  1 pearson
#This would get the semi-partial correlation between naturalness and preference while holding order constant for preference
spcor.test(x=natprefdata_2$Natural, y=natprefdata_2$Preference, z=natprefdata_2$Order)
##   estimate    p.value statistic    n gp  Method
## 1  0.45446 3.7282e-62    17.651 1200  1 pearson

Other R tools

Corrplot package - graphical representations of correlation matrices https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

library(corrplot)
source("cor.mtest.R")

cormat <- cor(natprefdata_2[,4:8])
cormat
##              Natural Preference EdgeDensity    Order Habitable
## Natural      1.00000    0.32749     0.26622 -0.31065  -0.79465
## Preference   0.32749    1.00000     0.11821  0.32787   0.06615
## EdgeDensity  0.26622    0.11821     1.00000 -0.18351  -0.15223
## Order       -0.31065    0.32787    -0.18351  1.00000   0.49035
## Habitable   -0.79465    0.06615    -0.15223  0.49035   1.00000
A <- cor.mtest(natprefdata_2[,4:8], .95)
corrplot(cormat, method="number", type = "lower", p.mat = A[[1]], sig.level = .001, tl.col = "black")

pairs(natprefdata_2[,4:8])

Stargazer package - outputs tables of models for papers https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf

library(stargazer)

stargazer(multreg1, multreg_interaction1, type="text")
## 
## ==============================================================================
##                                             Dependent variable:               
##                              -------------------------------------------------
##                                              PreferenceZScore                 
##                                        (1)                      (2)           
## ------------------------------------------------------------------------------
## NaturalZScore                        0.329***                 0.329***        
##                                      (0.027)                  (0.027)         
##                                                                               
## DurationZScore                        -0.041                   -0.041         
##                                      (0.027)                  (0.027)         
##                                                                               
## NaturalZScore:DurationZScore                                   -0.015         
##                                                               (0.027)         
##                                                                               
## Constant                              -0.000                   0.0005         
##                                      (0.027)                  (0.027)         
##                                                                               
## ------------------------------------------------------------------------------
## Observations                          1,200                    1,200          
## R2                                    0.109                    0.109          
## Adjusted R2                           0.107                    0.107          
## Residual Std. Error             0.945 (df = 1197)        0.945 (df = 1196)    
## F Statistic                  73.154*** (df = 2; 1197) 48.846*** (df = 3; 1196)
## ==============================================================================
## Note:                                              *p<0.1; **p<0.05; ***p<0.01
stargazer(multreg1, multreg_interaction1, type="text", intercept.bottom = FALSE, intercept.top = TRUE, ci = TRUE, single.row = T)
## 
## ==============================================================================
##                                             Dependent variable:               
##                              -------------------------------------------------
##                                              PreferenceZScore                 
##                                        (1)                      (2)           
## ------------------------------------------------------------------------------
## Constant                      -0.000 (-0.053, 0.053)   0.0005 (-0.053, 0.054) 
## NaturalZScore                0.329*** (0.275, 0.382)  0.329*** (0.275, 0.382) 
## DurationZScore                -0.041 (-0.094, 0.013)   -0.041 (-0.095, 0.013) 
## NaturalZScore:DurationZScore                           -0.015 (-0.068, 0.038) 
## ------------------------------------------------------------------------------
## Observations                          1,200                    1,200          
## R2                                    0.109                    0.109          
## Adjusted R2                           0.107                    0.107          
## Residual Std. Error             0.945 (df = 1197)        0.945 (df = 1196)    
## F Statistic                  73.154*** (df = 2; 1197) 48.846*** (df = 3; 1196)
## ==============================================================================
## Note:                                              *p<0.1; **p<0.05; ***p<0.01

Lots of other R packages available for creating nicer tables as well: knitr (kable()), apaTables, formattable, for example.